Skip to content

V2 rewrite (beta): Support Spark Connect#254

Open
chenliu0831 wants to merge 17 commits into
masterfrom
v2_rewrite
Open

V2 rewrite (beta): Support Spark Connect#254
chenliu0831 wants to merge 17 commits into
masterfrom
v2_rewrite

Conversation

@chenliu0831

@chenliu0831 chenliu0831 commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:

This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.

The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.

Motivation

The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:

  • Required local Spark session with JVM access
  • Python lambdas couldn't be serialized for remote execution
  • Tight coupling between Python client and JVM made debugging difficult

Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.

Code Changes

  • New pydeequ/v2/ module with Spark Connect implementation:

    • checks.py - Check and constraint builders
    • analyzers.py - Analyzer classes
    • predicates.py - Serializable predicates (eq, gte, between, etc.)
    • verification.py - VerificationSuite and AnalysisRunner
    • proto/ - Protobuf definitions and generated code
  • New test suite in tests/v2/:

    • test_unit.py - Unit tests (no Spark required)
    • test_analyzers.py - Analyzer integration tests
    • test_checks.py - Check constraint tests
    • test_e2e_spark_connect.py - End-to-end tests
  • Updated documentation:

    • Merged README with 2.0 quick start guide
    • Added architecture diagram
    • Migration guide from 1.x to 2.0

API Changes

# Before (1.x)
from pydeequ.checks import Check, CheckLevel
check.hasSize(lambda x: x == 3)

# After (2.0)
from pydeequ.v2.checks import Check, CheckLevel
from pydeequ.v2.predicates import eq
check.hasSize(eq(3))

Testing

More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread pydeequ/v2/verification.py Outdated
Comment thread pydeequ/v2/verification.py Outdated

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 8c93b14f) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/__init__.py
Comment thread pydeequ/__init__.py
Comment thread tests/conftest.py
Comment thread pydeequ/v2/spark_helpers.py
Comment thread pyproject.toml
Comment thread pyproject.toml Outdated
Comment thread .github/workflows/base.yml
Comment thread pydeequ/v2/profiles.py Outdated
Comment thread pydeequ/v2/suggestions.py
Comment thread pydeequ/v2/checks.py Outdated
- Fix NameError in tests/conftest.py setup_pyspark shim
- Drop duplicate PyDeequSession entry in __getattr__ tuple
- Widen pyspark pin to >=3.5.0,<4.0.0; remove empty extras table
- Bump black target_version to py39 to match python requirement
- Raise ValueError on Check.where() before any constraint
- Replace fixed sleep with port-wait for Spark Connect server in CI

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

# Conflicts:
#	.github/workflows/base.yml
#	poetry.lock
#	pyproject.toml

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Implements Stage 1 of the protobuf architecture review (see deequ repo
docs/adr/0001..0005). The wire format moves from a string-discriminator
ConstraintMessage/AnalyzerMessage to typed `oneof` arms — one per Check
or Analyzer builder method — with reused `…Spec` payload submessages.

Schema-driven builder rewrites:
- pydeequ/v2/predicates.py — Predicate uses CompareOp enum; field
  renamed `operator` → `op`.
- pydeequ/v2/checks.py — each Check builder method (isComplete,
  hasPattern, isContainedIn, isLessThan, …) populates the matching
  oneof arm directly. The old _add_constraint(constraint_type, …)
  shape-shifter is gone.
- pydeequ/v2/analyzers.py — same pattern. Compliance gets its own
  ComplianceAnalyzerSpec (instance/predicate, no longer mislabeled as
  column/pattern). Pair-column constraints use named column_a/column_b.
- pydeequ/v2/suggestions.py — Rules enum values map to ConstraintRuleSet
  proto enum values directly. testset_seed=0 is now a legal user choice
  (proto3 `optional`).
- pydeequ/v2/profiles.py — `optional` low_cardinality_histogram_threshold.

Stub delivery (graphframes pattern, per ADR-0005):
- pydeequ/v2/proto/deequ_connect_pb2.py and _pb2.pyi are checked in.
  Refresh by running `DEEQU_PROTO_PATH=… python scripts/regen_proto.py`
  whenever the deequ schema changes; intermediate .proto is gitignored.
- scripts/regen_proto.py is a developer convenience, not a build hook.
- pyproject.toml gains grpcio-tools as a dev dependency for the regen
  script.
- .github/workflows/base.yml gains a (currently disabled) drift-check
  step that will activate once the matching deequ JAR ships .proto under
  META-INF/protobuf/.

Cross-cutting:
- WIRE_FORMAT_VERSION constant + field dropped from runners and proto
  __init__.py per ADR-0005 (Spark Connect's RelationPlugin contract uses
  the protobuf type URL as the version discriminator).
- tests/v2/test_unit.py rewritten to assert msg.WhichOneof("body")
  shape; 11 unit tests all pass.

Verified via end-to-end smoke test (Spark 3.5 Connect server + freshly
built deequ JAR + tutorials/data_quality_example_v2.py): 9 analyzers, 12
constraints (1 expected failure on duplicate detection), 10-column
profile, 19 constraint suggestions all produce correct DataFrames.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants